Search CORE

14 research outputs found

Enriching the transformer with linguistic factors for low-resource machine translation

Author: Armengol Estapé Jordi
Escolano Peinado Carlos
Ruiz Costa-Jussà Marta
Publication venue: 'Assoc. for Computational Linguistics Bulgaria'
Publication date: 01/01/2021
Field of study

Introducing factors, that is to say, word features such as linguistic information referring to the source tokens, is known to improve the results of neural machine translation systems in certain settings, typically in recurrent architectures. This study proposes enhancing the current state-of-the-art neural machine translation architecture, the Transformer, so that it allows to introduce external knowledge. In particular, our proposed modification, the Factored Transformer, uses linguistic factors that insert additional knowledge into the machine translation system. Apart from using different kinds of features, we study the effect of different architectural configurations. Specifically, we analyze the performance of combining words and features at the embedding level or at the encoder level, and we experiment with two different combination strategies. With the best-found configuration, we show improvements of 0.8 BLEU over the baseline Transformer in the IWSLT German-to-English task. Moreover, we experiment with the more challenging FLoRes English-to-Nepali benchmark, which includes both extremely low-resourced and very distant languages, and obtain an improvement of 1.2 BLEUThis work is supported by the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement No. 947657).Peer ReviewedPostprint (published version

UPCommons. Portal del coneixement obert de la UPC

Bind the Gap: Compiling Real Software to Hardware FFT Accelerators

Author: Ainsworth Sam
Armengol Estapé Jordi
O'Boyle Michael F P
Woodruff Jackson
Publication venue
Publication date: 09/06/2022
Field of study

Edinburgh Research Explorer

Transfer Learning with Shallow Decoders: BSC at WMT2021’s Multilingual Low-Resource Translation for Indo-European Languages Shared Task

Author: Armengol Estapé Jordi
Gibert Bonet Ona de
Kharitonova Ksenia
Melero Maite
Rodríguez i Alvarez Mar
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/2021
Field of study

This paper describes the participation of the BSC team in the WMT2021{'}s Multilingual Low-Resource Translation for Indo-European Languages Shared Task. The system aims to solve the Subtask 2: Wikipedia cultural heritage articles, which involves translation in four Romance languages: Catalan, Italian, Occitan and Romanian. The submitted system is a multilingual semi-supervised machine translation model. It is based on a pre-trained language model, namely XLM-RoBERTa, that is later fine-tuned with parallel data obtained mostly from OPUS. Unlike other works, we only use XLM to initialize the encoder and randomly initialize a shallow decoder. The reported results are robust and perform well for all tested languages.Postprint (author's final draft

UPCommons. Portal del coneixement obert de la UPC

Pre-trained biomedical language models for clinical NLP in Spanish

Author: Armengol Estapé Jordi
Gonzalez Agirre Aitor
Gutiérrez Fandiño Asier
Llop Joan
Pio Carrino Casimiro
Pàmies Marc
Silveira Ocampo Joaquín
Valencia Alfonso
Villegas Marta
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/05/2022
Field of study

This work presents the first large-scale biomedical Spanish language models trained from scratch, using large biomedical corpora consisting of a total of 1.1B tokens and an EHR corpus of 95M tokens. We compared them against general-domain and other domain-specific models for Spanish on three clinical NER tasks. As main results, our models are superior across the NER tasks, rendering them more convenient for clinical NLP applications. Furthermore, our findings indicate that when enough data is available, pre-training from scratch is better than continual pre-training when tested on clinical tasks, raising an exciting research question about which approach is optimal. Our models and fine-tuning scripts are publicly available at HuggingFace and GitHub.This work was funded by the Spanish State Secretariat for Digitalization and Artificial Intelligence (SEDIA) within the framework of the Plan-TLPeer ReviewedPostprint (published version

UPCommons. Portal del coneixement obert de la UPC

A pipeline for large raw text preprocessing and model training of language models at scale

Author: Armengol Estapé Jordi
Publication venue: Universitat Politècnica de Catalunya
Publication date: 25/01/2021
Field of study

The advent of Transformer-based (i.e., based on self-attention architectures) language models has revolutionized the entire field of Natural Language Processing (NLP). Once pre-trained on large, unlabelled corpora, we can apply transfer learning to virtually all downstream tasks. The paradigmatic example is the BERT model. Recent works have proposed alternative pre-training algorithms and neural architectures for improving the efficiency or the performance of the models. Besides, the ecosystem of frameworks for using these models has flourished. Nevertheless, less attention has been paid to the practical issues of preparing new corpora for pre-training language models and training them effectively from scratch in High-Performance Computing (HPC) clusters. Preprocessing new corpora is critical for languages and domains that do not have enough published resources. In contrast, the practical details of training language models from scratch are less known than those for fine-tuning existing models. Also, if the quality of the data is enhanced, language and domain-specific language models have already been shown to outperform their multilingual and general-domain counterparts, at least in some cases. This project consists of developing a preprocessing and training pipeline for generating language models at scale, especially targeting under-resources. The preprocessing pipeline's crucial role consists of cleaning raw text and formatting it as needed while preserving document-level coherency (if possible) to learn long-range dependencies. Most of the existing data gathering and cleaning methods for NLP have focused more on the quantity than the quality. Since our approach aims to be compatible with low-resource languages and domains, the filtering should be as fine-grained as possible (or risk losing useful data). Unlike other works, we put special emphasis on the generation of resources for training these models. Regarding training, learning from scratch large models presents several challenges, even if leveraging existing libraries. Apart from adapting to the specifics of an HPC cluster and a careful choice of hyperparameters, ideally, the training procedure should be relatively low-resource-friendly. We show our system's application for generating new corpora in real-world use cases and how these data can be effectively used for training models from scratch

UPCommons. Portal del coneixement obert de la UPC

Neural Machine Translation and Linked Data

Author: Armengol Estapé Jordi
Publication venue: Universitat Politècnica de Catalunya
Publication date: 03/07/2019
Field of study

La traducció automàtica és la tasca de traduir automàticament d'un idioma a un altre. Per dades enllaçades entenem aquelles produïdes de manera estructurada i entrellaçada. Actualment, la traducció automàtica es porta a terme mitjançant tècniques d'aprenentatge profund. Aquests algorismes tenen el desavantatge de requerir grans quantitats de dades per entrenar-se. BabelNet (un exemple de dades enllaçades) consisteix en conceptes i entitats connectades en 271 idiomes. Aquest projecte, basat en publicacions recents, vol explotar BabelNet en els últims sistemes de traducció automàtica neuronal. En particular, aquest treball proposa d'extreure conceptes de BabelNet i d'utilitzar-los juntament amb les paraules amb l'objectiu de millorar la traducció, particularment en casos de tasques amb pocs recursos. També es provaran característiques lingüístiques clàssiques.Machine translation is the task of automatically translating from one language into another. By linked data, we mean structured, interlinked data. Currently, machine translation is addressed by means of deep learning techniques. These algorithms have the big drawback of requiring large quantities of data to be trained. BabelNet (an example of linked data) consists of concepts and named entities connected in 271 languages. This project, based on previous emergent works, wants to exploit BabelNet in the latest Neural Machine Translation systems. In particular, this work proposes extracting concepts from BabelNet and using them alongside words in order to improve machine translation, particularly in low-resource settings. Classical linguistic features will be tested as well

UPCommons. Portal del coneixement obert de la UPC

Neural Machine Translation and Linked Data

Author: Armengol Estapé Jordi
Publication venue: Universitat Politècnica de Catalunya
Publication date
Field of study

RECERCAT

Semantic and syntactic information for neural machine translation: Injecting features to the transformer

Author: Armengol Estapé Jordi
Ruiz Costa-Jussà Marta
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 18/05/2021
Field of study

Introducing factors such as linguistic features has long been proposed in machine translation to improve the quality of translations. More recently, factored machine translation has proven to still be useful in the case of sequence-to-sequence systems. In this work, we investigate whether this gains hold in the case of the state-of-the-art architecture in neural machine translation, the Transformer, instead of recurrent architectures. We propose a new model, the Factored Transformer, to introduce an arbitrary number of word features in the source sequence in an attentional system. Specifically, we suggest two variants depending on the level at which the features are injected. Moreover, we suggest two combination mechanisms for the word features and words themselves. We experiment both with classical linguistic features and semantic features extracted from a linked data database, and with two low-resource datasets. With the best-found configuration, we show improvements of 0.8 BLEU over the baseline Transformer in the IWSLT German-to-English task. Moreover, we experiment with the more challenging FLoRes English-to-Nepali benchmark, which includes both low-resource and very distant languages, and obtain an improvement of 1.2 BLEU. These improvements are achieved with linguistic and not with semantic information.This work is supported by the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (Grant Agreement No. 947657).Peer ReviewedPostprint (published version

UPCommons. Portal del coneixement obert de la UPC